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Abstract 



This paper describes a probabilistic multiple-hypothesis framework for tracking 
highly articulated objects. In this framework, the probability density of the tracker 
state is represented as a set of modes with piecewise Gaussians characterizing the 
neighborhood around these modes. The temporal evolution of the probability density is 
achieved through sampling from the prior distribution, followed by local optimization 
of the sample positions to obtain updated modes. This method of generating hypothe- 
ses from state-space search does not require the use of discrete features unlike classical 
multiple-hypothesis tracking. The parametric form of the model is suited for high- 
dimensional state-spaces which cannot be efficiently modeled using non-parametric 
approaches. Results are shown for tracking Fred Astaire in a movie dance sequence. 
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1 Introduction 

Visual tracking of human motion is a key technology in a broad range of applications 
from user-interfaces to video editing. This paper addresses the problem of figure track- 
ing, using a known kinematic model to describe the skeletal constraints [17, 1 2, 28, 22] . 
The kinematics of an articulated object provide the most fundamental constraint on its 
motion. Kinematic models play two roles in tracking. First, they define the desired 
output — a state vector of joint angles that encodes the degrees of freedom of the model. 
Second, they specify the mapping between states and image features that makes regis- 
tration possible. 

A key attribute of any tracking scheme is the choice of probabilistic representation 
for the state estimates. The Kalman filter [2] is a classical choice which has been 
employed in earlier figure tracking work (see [18, 15, 25] for examples). Unfortunately 
the Kalman filter is restricted to representing unimodal probability distributions. The 
presence of background clutter, self-occlusions, and complex dynamics during figure 
tracking results in a state space density function (pdf) which is multi-modal. 

Multiple hypothesis tracking (MHT) is a classical approach to representing multi- 
modal distributions with Kalman filters [4]. MHT methods have been used with great 
effectiveness in radar tracking systems, for example. They are designed to process 
a discrete set of measurements, such as radar returns, at each time instant. A repre- 
sentative approach is Reid's algorithm [24], which employs a bank of Kalman filters 
to evaluate the combinatoric assignments between discrete measurements and targets. 
Unfortunately, for visual tracking applications with complex targets the requisite "sen- 
sor" typically does not exist. For example, there is no generic figure detector which 
takes an input image and outputs a set of figure measurements, where each measure- 
ment is a different possible skeletal configuration. 

An alternative to classical MHT is the class of Monte Carlo methods such as Isard 
and Blake's CONDENSATION algorithm [13]. These techniques employ a nonpara- 
metric sample-based representation of the pdf which can model arbitrary densities. 
These methods have the disadvantage that the required number of samples grows ex- 
ponentially with the size of the state space. As a consequence, an accurate dynamic 
model is required in practice to reduce the number of samples needed for accurate 
modeling. These factors make nonparametric techniques less attractive for objects like 
the human figure, which have both a large state space and complex dynamics. 

This paper describes a novel MHT formulation which is suitable for figure tracking. 
The key idea is to explicitly model and track the modes in the state pdf. Our approach 
is based on the hypothesis that visually complex targets such as the human figure will 
typically have a small number of well-defined minima in their posterior density. We 
use a sampling-based state space search process to generate a set of hypotheses cor- 
responding to the local maxima in the likelihood function. By generating hypotheses 
through state space search we avoid the need for the explicit figure detector required 
by classical MHT methods. By focusing our probabilistic representation on the modes 
of the distribution we avoid the explosion in the number of samples that a Monte-Carlo 
scheme requires. A more detailed comparison can be found in section 5.1. This work 
is the first application of multiple hypothesis techniques to figure tracking. 
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2 A KINEMATIC MODEL FOR FIGURE REGISTRATION 



2 A Kinematic Model for Figure Registration 



Most of the previoous work on articulated object tracking has focused on the use of 3-D 
kinematic models to estimate the detailed 3-D motion of hands and figures. These ap- 
proaches require multiple camera viewpoints for accurate estimation and rarely operate 
on live video (one exception is [22]). In contrast, there are many applications of figure 
tracking in which only a single camera input is available. One example which moti- 
vates this report is the recovery of human motion from movie footage. Another class 
of examples are vision-based user-interfaces which could benefit from coarse measure- 
ments of body pose suitable for gesture recognition, but are unlikely to require accurate 
3-D pose recovery. 

This report addresses^zgMre registration, which is the estimation of 2D image plane 
figure motion across a video sequence. Figures are described by a novel class of 2D 
kinematic models called Scaled Prismatic Models (SPM), introduced in [16]. These 
models enforce 2D constraints on figure motion that are consistent with an underlying 
3D kinematic model. Unlike 3D kinematic models, SPM's do not require detailed prior 
knowledge of figure geometry and do not suffer from singularity problems when they 
are used with a single video source. 

Each link in a scaled prismatic model describes the image plane appearance of an 
associated rigid link in an underlying 3D kinematic chain. Each SPM link can rotate 
and translate in the image plane, as illustrated in Figure 1 . The link rotates at its joint 
center around an axis which is perpendicular to the image plane. This captures the 
effect on link orientation of an arbitrary number of revolute joints in the 3D model. The 
translational degree of freedom (DOF) models the distance between the joint centers 
of adjacent links. It captures the foreshortening that occurs when 3D links rotate into 
and out of the image plane. This DOF is called a scaled prismatic joint because in 
addition to translating the joint centers it also scales a template representation of the 
link appearance. 



Figure 1: The effect of revolute (8) and prismatic (d) DOF's on one link from a 2D 
SPM chain. The arrows show the instantaneous velocity of points along the link due to 
an instantaneous state change. 

A complete discussion of SPM models, including a derivation of the SPM Jacobian 
and an analysis of its singularities, can be found in [16]. In this report we model the 
figure as a branched SPM chain. Each link in the arms, legs, and head is modeled as an 
SPM link. Each link has two degrees of freedom, leading to a total body model with 
19 DOF's. The tracking problem consists of estimating a vector of SPM parameters 
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for the figure in each frame of a video sequence, given some initial state. 

3 Mode-based Multiple-Hypothesis Tracking 

The central goal of a probabilistic tracking framework is to evolve the probability dis- 
tribution of the tracker state over time. Our approach is based on a parametric repre- 
sentation of the modes (local maxima) of the probability density function (pdf) which 
describes the uncertainty in the state. We use Gaussian kernels to model the pdf in the 
local neighborhood surrounding each mode and to interpolate between modes. Kernel 
functions provide a compact description of the pdf, in contrast to the large number of 
samples a non-parametric method would employ in modeling each mode. Each kernel 
can be viewed as a hypothesis about the tracker state, establishing a connection with 
classical MHT methods. 

Our adoption of a mode-based representation is based on two assumptions: that 
the underlying pdf has well-defined modes, and that these modes capture the essential 
structure of the pdf which is required for accurate tracking. We believe this is a reason- 
able assumption for complex visual targets like the figure, and the experimental results 
we present in section 4 support this hypothesis. 

The key step in our mode-based tracking algorithm is a technique, called sample 
refinement, for updating the modes of the pdf given an input image. Sample refinement 
uses sampling from a prior distribution to search for peaks in the likelihood function. 
It is described in detail in section 3.3. 

Our tracking algorithm consists of a series of three steps which are linked through 
Bayes Rule: 

p(x t \Z t ) = kp(z t \x t ) p(x t \Z t -i) (1) 

where x t is the tracker state at time t, z t is the observed data, Z t is the aggregation of 
past image observations (ie. z T for r = 0, . . . ,t), and A; is a normalization constant. 
Furthermore z t is assumed to be conditionally independent of Z t -\ given x t . 
The stages of the algorithm at each time-step t are: 

1. Prediction The prior density p(x t \Z t _i) is generated by passing the modes of 
p{x t -i \Z t -i) through the Kalman filter prediction step. 

2. Likelihood computation This involves: 

(a) Creating initial hypothesis seeds by sampling the prior distribution p(x t \Z t -\). 

(b) Refining the samples through differential state-space search to obtain the 
modes (local maxima) of the likelihood function p(zt \xt). 

(c) Measure the local statistics associated with each likelihood mode using 
perturbation analysis. 

3. Posterior Update The posterior density p(x t \Z t ) is computed via Baye's Rule 
(equation 1) and the set of modes is updated. 

Each of these stages outputs a multimodal description of the state pdf. The piece- 
wise Gaussian representation which we employ in modeling the multimodal pdfs is 
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3 MODE-BASED MULTIPLE-HYPOTHESIS TRACKING 



described in section 3.1. Sections 3.2, 3.3, and 3.4 describe the three stages of our 
algorithm in detail. 

3.1 Piecewise Gaussian Kernels 

The use of kernels to model probability density functions is a classical problem in 
statistics and a wide variety of solutions are available (see [11] for a recent survey). 
Our current approach is based on the use of Gaussian kernels to model the pdf in the 
immediate vicinity of each mode. To describe the pdf in the state space regions between 
the modes we use the max function to select the kernel with the highest likelihood. This 
leads to a piecewise Gaussian representation of the pdf. One advantage of this approach 
is its computational simplicity, since the kernel parameters are determined entirely by 
local pdf values. This approach is similar in spirit to locally-weighted regression [3]. 

We can define the piecewise Gaussian representation as follows: Given a set of 
N modes for which the ith mode has a state m,, an estimated covariance Si and a 
probability p,, an accurate construction of the probability density function requires a 
local maxima of value located at each m,, with the local neighborhood surrounding 
rrii being approximately Gaussian with covariance Si. 

Given locally fitted Gaussian kernels, one might be tempted to combine them into a 
Gaussian sum representation by direct superposition. When the modes occur in clumps 
(which happens frequently) this approach will produce errors, as figure 2 illustrates. A 
simple example of four hypotheses in a 1-D state-space is shown in figure 2(a). If the 
hypotheses are summed the combined pdf has only two modes, as shown in figure 2(b). 
This results in a cluster of weaker modes being over-represented at the expense of 
strong but isolated modes. 




(a) (b) (c) 

Figure 2: (a) shows four recovered modes of a probability distribution together with 
local statistics. Using a Gaussian sum approximation with components located at the 
hypotheses would produce the distribution shown in (b), which has only two modes, 
and also the dominant mode is formed from the cluster of weaker modes. The modes 
and local variances are however preserved if a piecewise Gaussian approximation is 
used (c). 

We employ a Piecewise Gaussian (PWG) representation where the probability den- 
sity p(x) at a point x in the state-space is determined by the Gaussian component 
providing the largest contribution at x, ie. 

p(x) = k max N <pt exp ( -^(x - m/S," 1 ^ - m,) j 1 (2) 



3.2 Prediction 
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where A; is a normalization constant. 

If for the previous example a PWG representation is employed, as illustrated in fig- 
ure 2(c), the strengths of each of the modes are preserved. This is preferable since the 
representation is then consistent with the local statistics determined for each hypothe- 
sis. 

An accurate Gaussian sum representation could be obtained through a more com- 
plex and costly fitting process using the EM algorithm [21]. However, we have found 
that the PWG representation provides satisfactory approximation at a greatly reduced 
cost of fitting. This representation does have two disadvantages: Sampling from the 
PWG representation and propagating it through a dynamic model are not as straight- 
forward as they would be for a Gaussian mixture model. These points are discussed 
further in sections 3.2 and 3.3.1. 



3.2 Prediction 

The prior density p(xt | Z>t-\) m me next time frame is obtained by applying the Kalman 
filter prediction step to each of the modes of the posterior distribution p(xt-\ \Z t -\) in 
the previous time frame. A dynamical model predicts the new locations of the modes, 
while the covariances of the Gaussian components are increased according to the pro- 
cess noise. The amount of process noise is determined by the accuracy of the dynamical 
model. This process is illustrated in Figures 3(a) and (b), which show a 1-D distribution 
before and after prediction. This formulation may also be viewed as an approximation 
to the result p{x t \Z t -\) = J Xt p(x t \x t -i)p(x t -i \Z t -\), where p(x t \x t -\) is a Gaus- 
sian centered on the new mode with covariance equal to the process noise covariance. 




Figure 3: Prediction step of MHT algorithm. Two modes of a 1-D density for x (a) are 
extrapolated to the next time step (b). 

A disadvantage of using the PWG representation is that the application of the stan- 
dard Kalman filter steps to individual modes for computing prior and posterior distri- 
butions is only mathematically correct for a Gaussian sum parameterization. However, 
the Kalman filter steps are reasonable approximations in a PWG representation when 
the significant modes of the distribution are well-defined with small local variances. 
This is the situation encountered when observation noise is low and the hypotheses 
represent discrete well-defined ambiguous configurations, as opposed to the situation 
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3 MODE-BASED MULTIPLE-HYPOTHESIS TRACKING 



when observation noise is high where ambiguous configurations are fused and contin- 
uous in nature. In the sequences used for testing our tracker, the separate hypotheses 
result from clutter and self-occlusions rather than camera noise. This justifies the use 
of the PWG representation within the tracking framework. 

In the experiments carried out for this paper, we did not use a trained or complex 
dynamical model. The dynamical model employed is simply a naive constant velocity 
predictor, and consequently the process noise applied is very high since the prediction 
is often grossly inaccurate. 



3.3 Likelihood Computation 

At the heart of any visual tracking problem is the computation of the likelihood of 
the observed data given a state model. Data likelihood is the fundamental source of 
multimodality in visual tracking, as there are typically many different state configura- 
tions that are consistent with a given set of image measurements. Our approach is to 
describe the likelihood surface using a collection of Gaussian kernels. This section de- 
scribes the algorithm for computing kernel positions and parameters given a predicted 
state pdf and an input image. 



3.3.1 Hypothesis Sampling 

We employ sampling from the predicted state pdf to generate starting points for the 
local search process that identifies the modes in our likelihood representation. We first 
consider the case of sampling from a single truncated Gaussian. This involves obtaining 
samples from the original Gaussian distribution (eg. we used publicly available code 
based on [1]), followed by discarding the samples which fall outside the truncation 
boundary. This may be continued until a satisfactory number of valid samples have 
been obtained. Figure 4(a) shows a representative outcome of the sampling process for 
the predicted distribution shown in Figure 3(b). 



p(x t \Z t -i) 





Figure 4: (a) Samples drawn from the 1-D density of Figure 4 are shown as dots, (b) 
Each sample seeds a local search of the likelihood function. 



The PWG distribution may be equivalently expressed as a union of separate trun- 
cated Gaussians with aligned borders, where the borders denote points for which the 
probability values computed from either Gaussian component on opposite sides of the 
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border are the same (ie. there are no probability discontinuities at the borders). Sam- 
pling from the PWG distribution may therefore be carried out with the following steps: 

1 . Select the ith mode with probability pi from the set of N modes (using notation 
defined in section 3.1). 

2. Obtain a single sample s from the original Gaussian distribution associated with 
the ith mode. 

3. If s lies within the boundaries of the ith mode (ie. p(s) satisfies (2)), accept the 
sample; otherwise reject it. 

4. Return to step 1 until the required number of accepted samples have been ob- 
tained. 

3.3.2 State-Space Search for Likelihood Modes 

Starting with the initial SPM model states obtained from sampling the prior distribution 
p(xt\Z t _i), the states are optimized locally in order to converge on the modes of the 
likelihood p(zt | Xt ) . This achieved by maximizing (3), or equivalently by obtaining 



This is in fact identical to differential template registration of the 2D SPM model 
whereby the sum of squared pixel residuals is minimized. For this we employ the 
iterative Gauss-Newton method, which has an advantage of simultaneously recovering 
the local variances associated with the modes. This search process is illustrated in Fig- 
ure 4(b). Arrows show the direction of steepest ascent from each seed point. Note that 
a given model may attract multiple seed points. 

3.3.3 State Probabilities from Image Measurements 

Given the detected modes of the likelihood surface, the final step in computing the 
likelihood model is the estimation of the parameters of the Gaussian modes. This can 
be accomplished using an image noise model which gives the probability that the target 
figure, when correctly represented by an SPM model with state x, generates the image 
observation z t in the current frame. The model can be written 



where u represent image pixel coordinates, I(u) are the image pixel values at u, 
T(u, x) are the overlapping template pixel values at u when the SPM model has state 
x, and <t 2 is the pixel noise variance (this has to be known apriori or experimentally 
obtained). The product is then evaluated for all pixels located within the boundaries of 
the figure. 

The final PWG representatation of the likelihood is illustrated in Figure 5 for the 
example of Figure 4. The detected modes are shown as black circles. The PWG surface 



arg min { ^ (I(u) - T(u, x)f } 



u 




(3) 
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3 MODE-BASED MULTIPLE-HYPOTHESIS TRACKING 



overlays the "true" likelihood surface, and the seed points are drawn in gray. Approxi- 
mation error is greatest where the modes are close together. 



Figure 5: Piece-wise Gaussian likelihood model constructed from the detected mode 
positions shown as black circles. The true likelihood (from Figure 4) is shown as a 
dashed line. 

Based on (3), it may be observed that the likelihood can be maximized by minimiz- 
ing (I(u) — T(u, x t )) 2 - This is achieved through template registration, which may be 
considered equivalent to recovering the local maximum likelihood solution. 

3.4 Posterior Update 

Computing the posterior density via (1) involves the multiplication of the prior density 
p(xt\Z t -i) and likelihood p(z t \x t ) functions, where both functions are represented 
in PWG forms as described in the previous sections. The posterior density may be 
approximated by taking pairs of modes from the prior and likelihood distributions and 
multiplying the Gaussians independently. This may be further trimmed by selecting 
only the dominant posterior modes. 

However in our experiments, the posterior density is taken to be identical to the 
likelihood. This simplification is acceptable because we used a simple constant velocity 
predictor with correspondingly high process noise. The modes of the likelihood are the 
dominant factors in this case. If a superior predictor were available, better results could 
be obtained by modeling the posterior density more accurately. 

3.4.1 Posterior versus Likelihood Distributions 

An important point to note is while the posterior density incorporates all available 
information at the end of each time frame, it may also be useful to retain the likelihood 
distributions as well. This is true when for example an off-line process is available 
to refine the tracking using further knowledge not available to the original tracker. 
This refinement may be achieved with more advanced target and dynamical models as 
well as using the observations in batch mode rather than in a sequential manner. For 
example in the figure tracking scenario, a 3D kinematic model with angular and length 
constraints may be employed off-line to improve on the initial tracking made with a 
2D SPM model; additionally more accurate 2D or 3D dynamical models may be used 
to improve the tracking made using a simple constant velocity model. The original 
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posterior distributions should not be used as input since they incorporate erroneous 
prior knowledge which is superseded when the improved models are used. Instead, the 
likelihood distributions can be used as input because they encode solely the information 
obtained from observations made within each time frame. 

4 Experimental Results 

The algorithm was tested on three sequences involving Fred Astaire from the movie 
'Shall We Dance'. A 2D 19-DOF SPM model is manually initialized in the first image 
frame, after which tracking is fully automatic. The augmented state-space in this case 
has 38 dimensions because the predictor used is a second order auto-regressive (AR) 
model. Typically the joint probability distribution in the state-space is described via 10 
modes in a PWG representation. 

In fig. 6, three key frames from an original sequence of eighteen frames are shown, 
together with the results obtained from using a single mode tracker. Here the stick 
figure denotes the current state of the tracker. It can be observed that the tracker fails to 
cope with the ambiguity resulting from self-occlusion when Fred Astaire's legs cross. 

In fig. 7, the multiple modes of the tracker are shown in the top row. The bottom row 
shows the dominant mode at each frame, which is solely determined via minimum pixel 
squared residual error. This shows the ability of the tracker to handle the ambiguities 
of self-occlusion by maintaining multiple modes, without even the need for a complex 
dynamical model. 




Figure 6: Single Mode Tracking Results. Top row: three frames from the original 
sequence. Bottom row: the single-hypothesis tracker fails to handle the self-occlusion 
caused by Fred Astaire's legs crossing. 

However, the computational cost of using multiple modes increases at least linearly 
with the number of modes. In the above case, the single-mode tracker completed the 
tracking sequence of 18 frames in about 18 seconds. The 10-mode tracker required 
approximately 2 minutes. Nevertheless the advantage gained from the stability of the 
tracker is significantly more critical. 
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Figure 7: Mode-based Multiple Hypothesis Tracking Results. Top row: the multiple 
modes of the tracker are shown. Bottom row: the dominant mode is shown, which 
demonstrate the ability of the tracker to handle ambiguous situations and thus survive 
the occlusion event. 

5 Previous Work 

The first works on articulated 3D tracking were [17, 12]. Yamamoto and Koshikawa [28] 
were the first to apply modern kinematic models and gradient-based optimization tech- 
niques, but their results were limited to 2D motion. Other 3D tracking works include 
[22, 23, 10, 5]. The work of Ju and et. al. [14] is perhaps the closest to our 2D SPM. 
Other 2D figure tracking results can be found in [27] . 

Early applications of Kalman filters (KF) to rigid body tracking appear in [6, 26, 9]. 
Figure tracking schemes which use the Kalman filter are discussed in [18, 15]. All 
of these works employ the conventional unimodal KF. One exception is Shimada et. 
al. [25], in which a simple multiple hypothesis approach is used to handle reflective 
ambiguity under orthographic projection. 

The first applications of classical multiple hypothesis tracking techniques to com- 
puter vision problems appeared in [8, 7]. An early survey of these techniques can be 
found in [19]. Recently, Rasmussen and Hager [20] used the joint probabilistic data 
association filter (JPDAF) [4] to track multi-part objects, such as a face and hand. In 
contrast to our MHT framework, the JPDAF approach uses a correspondence-based 
framework for generating hypotheses. Each target is influenced by a linear combina- 
tion of the resulting measurements. 

5.1 Comparisons to Classical MHT and Monte Carlo Methods 

Multiple hypothesis tracking was originally developed for radar tracking systems where 
the measured features are a set of discrete 'blips'. The multiple hypotheses are gen- 
erated by postulating associations between a single target and each of the different 
features. In the case of figure tracking there is however no detector for the human fig- 
ure which explicitly returns features giving different probable skeletal configurations 
in each image frame. One possible solution would be to consider all combinations of 
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lower-level features, eg. edges obtained from an edge detector, which form high-level 
'figure features'. However in scenes with significant clutter, this rapidly leads to an 
almost intractable number of hypotheses [8, 7]. More importantly, discrete features 
are not suitable to a large class of problems. For example when using models based 
on appearance or optic-flow, the data association between the model and image pixels 
is both probabilistic and continuous - every different set of pixels is a separate fea- 
ture with a corresponding probability of association to the model. In these instances, 
classical MHT methods are not applicable. 

Instead of using a separate feature-detection process based on image correspon- 
dences, our formulation of hypothesis sampling and local state-space search recovers 
MH states as part of the tracking process. This method is also capable of coping with 
the above-mentioned problems for which the feature set in continuous. The multiple 
hypotheses in our method are not simply data-association hypotheses between target 
and features, but state-space hypotheses which locally maximize the likelihood of the 
observed image. 

Alternatively Monte Carlo methods, such as the CONDENSATION algorithm [13], 
can be used. These methods express the pdf of the tracker state non-parametrically 
with a fair set of samples. The number of samples required for accurately modeling 
the pdf increases with both the dimensionality of the state space and the variance of the 
pdf, which in the case of tracking is inversely related to the accuracy of the predictor. 
In our case with 38 state-space dimensions and a weak constant velocity predictor, a 
prohibitive number of samples will be required for reliable tracking with CONDEN- 
SATION. A further problem with the sample-based pdf representation is that only the 
moments of the pdf can be recovered easily. Hence for example while it may be simple 
to compute the mean state, the maximum likelihood (ML) estimate may not be found 
accurately, and more significantly the maximum aposteriori (MAP) estimate is difficult 
to compute. 

Our approach copes with weak predictors and high-dimensional state spaces by 
carrying out sample refinement. This allows successful tracking to be achieved with 
only ten samples. Furthermore because a parametric representation is used throughout 
the entire process, both the MAP and ML estimates can be recovered easily. 

6 Conclusions and Future Work 

We have introduced a novel multiple hypothesis tracking algorithm for complex targets 
with high dimensional state spaces. The key insight is to represent and track the modes 
in the posterior state density function. These modes are likely to be sparse and sepa- 
rated for visually complex targets such as the human figure. Experimental results from 
tracking one of Fred Astaire's dance sequences demonstrates the superior performance 
of our MHT approach over a standard Kalman filter. 

In the near future we will present comparative experimental results to that of the 
CONDENSATION algorithm. We also plan to extend our MHT framework to han- 
dle self-occlusions and motion discontinuities in an explicit manner. We will also be 
investigating the integration of figure tracking with background modeling as well as 
figure-background segmentation. 
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